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Abstract 

The idea that many important classes of signals can be well-represented by linear combi- 
nations of a small set of atoms selected from a given dictionary has had dramatic impact on 
the theory and practice of signal processing. For practical problems in which an appropriate 
sparsifying dictionary is not known ahead of time, a very popular and successful heuristic is 
to search for a dictionary that minimizes an appropriate sparsity surrogate over a given set of 
sample data. While this idea is appealing, the behavior of these algorithms is largely a mystery; 
although there is a body of empirical evidence suggesting they do learn very effective represen- 
tations, there is little theory to guarantee when they will behave correctly, or when the learned 
dictionary can be expected to generalize. In this paper, we take a step towards such a theory. 
We show that under mild hypotheses, the dictionary learning problem is locally well-posed: 
the desired solution is indeed a local minimum of the £^ norm. Namely, if A £ ]]j"»x" jg 
incoherent (and possibly overcomplete) dictionary, and the coefficients X £ M."^^ follow a ran- 
dom sparse model, then with high probability {A, X) is a local minimum of the (.^ norm over 
the manifold of factorizations {A',X') satisfying A'X' — Y, provided the number of samples 
p = Q{n'^k). For overcomplete A, this is the first result showing that the dictionary learning 
problem is locally solvable. Our analysis draws on tools developed for the problem of completing 
a low-rank matrix from a small subset of its entries, which allow us to overcome a number of 
technical obstacles; in particular, the absence of the restricted isometry propertyFI 



1 Introduction 

To a great extent, progress in signal processing over the past four decades has been driven by the 
quest for ever more effective signal representations. The development of increasingly powerful, rele- 
vant representations for natural images, from Fourier and DOT bases }ANR74j to Wavelets |MG84j . 
Curvelets [CDDY06' and beyond, has significantly enriched our understanding of the structure of 
images, and has also spurred the development of influential practical coding standards [WalQlj . 
Because of this, hand design of signal representations has been a dominant paradigm in signal pro- 
cessing and applied mathematics. Indeed, it is difficult to overstate the intellectual and practical 
impact of this quest. 

Hovifever, there are voices of dissent. One competing train of thought, dating at least back to 
the advent of the Karhunen-Loeve transform in the 1970's, suggests that rather than meticulously 

^This work was partially performed while Q. Geng and H. Wang were interns in the Visual Computing Group at 
Microsoft Research Asia. The authors would like to thank Dan Spielman of Yale University and Yi Ma of MSRA for 
helpful discussions. 
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designing an appropriate representation for each class of signals we encounter, it may be possible 
to simply learn an appropriate representation from large sets of sample data. This idea has several 
appeals: Given the recent proliferation of new and exotic types of data (images, videos, web and 
bioinformatic data, ect.), it may not be possible to invest the intellectual effort required to develop 
optimal representations for each new class of signal we encounter. At the same time, data are be- 
coming increasingly high-dimensional, a fact which stretches the limitations of our human intuition, 
potentially limiting our ability to develop effective data representations. It may be possible for an 
automatic procedure to discover useful structure in the data that is not readily apparent to us. 

Spurred by this promise, researchers have invested a great amount of effort in developing al- 
gorithms that can automatically derive good representations for sample data. In particular, much 
recent effort has been focused on sparse linear representations. A signal y e M™ is said to have a 
sparse representation in terms of a given dictionary of basis signals A = [Ai, . . . , A„] S M™^" if 
y « Ax, where x E M" is a coefficient vector with only a few nonzero entries (fc = ||a;||o ^ n). This 
notion of sparsity has emerged in the past 10 years as a dominant idea in signal processing |BDE09j . 
This is due both to the ubiquity of sparsity (or near-sparsity) in practical problems, as well as a line 
of fundamental theoretical results |DE03[ lFiIc04l ICT05[ IZYMl IMYOQj that assert that if y is known 
to be sparse in a known basis A satisfying certain technical conditions, the sparse coefficients Xq 
can be very accurately estimated (sometimes perfectly so!) by solving an £^ minimization problem: 

minimize ||a;||i subject to y = Ax. (1-1) 

These theoretical results allows us to deploy tools from sparse signal representation with great 
confidence: if the signal y has a sparse representation, then efficient algorithms are guaranteed to 
recover it. 

When facing a new class of signals, however, it is not clear how to begin: what basis A might 
allow typical signals y to be sparsely represented? A popular heuristic is to search for a basis A that 
allows a given set of examples Y — [y^, . . . , y,^ e M^^p to be represented as compactly as possible. 
That is, we attempt to solve the following model problem, often referred to as "dictionary learning" : 

Given samples Y = [y^, . . . , y^] G W^^^ all of which can be sparsely represented in terms 
of some unknown dictionary A {Y = AX, for some X with sparse columns), recover A. 

A number of algorithms have been proposed for this problem |OF96[ IeXHHQQI IKDMR+031 IAEB061 
IMBPSIO] (see the survey [RBE10| for a more thorough review). Exploiting sparsity in learned 
dictionaries has led to practical success in a number of important problems in signal acquisition and 
processing IEA06I [BEMI lMBP+081 IRSMI lYWHMlOj . On the other hand, relatively little theory is 
available to explain when and why dictionary learning algorithms succeed. There is also little in the 
way of guidelines to tell practitioners when the learned dictionary is expected to generalize beyond 
the given sample set Y . This stands in contrast to the situation with hand-designed dictionaries, 
which often come with proofs of optimality for important classes of signals. 

In this paper, we take a step towards closing this gap. We study a model optimization approach 
to dictionary learning: 

minimize subject to Y = AX, \\Ai\\2 = 1 Vi. (1.2) 

Here, || • |li denotes the sum of magnitudes, ||X||i = This optimization problem was first 

studied by Gribonval and Schnass GSIO , as a natural abstraction of popular dictionary learning 



algorithms (we will dicsuss the results of [ GSlOj in more detail in Section 2.1 ). Notice that while the 



objective function in ( 1.2 1 is convex, the constraint is not. Hence, in general it may seem that all we 
can hope for is a local optimum. This is a common feature of dictionary learning algorithms. Indeed, 
it is a classical observation in source separation that if we take a permutation matrix 11 G M"^" and a 
diagonal matrix of signs S, then whenever (^4, X) solves the above problem, so does (AIIS, Sn*X). 



This "sign-permutation ambiguity" implies corresponding to every local minimum of (1.2 1, there is 
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Figure 1: Phase transitions in dictionary recovery? We test whether locally minimizing the 
i-^ norm correctly recovers the dictionary A e M™^" and sparse coefficients X e M"^^, for varying 
sparsity levels k and problem size n. Left: m = n. Middle: m = .8 x n. Right: rn = .6 x n. Here, 
p = 5n\og{n). Trials are judged successful if the relative error \\A — A\\p/\\A\\p in the recovered A 
is smaller than 10~^. We average over 10 trials; white corresponds to success in all trials, black to 
failure in all trials. The problems are solved using an algorithm outlined in [GWWllj . 



a class of 2"n! equivalent solutions. Moreover, a-priori there is nothing to prevent the existence of 
exponentially large classes of local minima. This might lead one to a dispiriting conclusion: "the 



problem (1.2 1 is impossible to solve in general; moreover, nothing rigorous can be said about its 
solution." 

Part of the goal of this paper is to dispel such pessimism. Figure [T] shows why there might 



be reason for hope. In it, we solve various synthetic instances of the problem (1.2), with varying 
problem size and sparsity level. The figure plots fractions of correct recoveries, for various aspect 
ratios m/n oi A € M™^". We observe a very intriguing phenomenon: 

Empirically, optimization algorithms for dictionary learning succeed when the the prob- 
lem is well-structured {X is sufficiently sparse), and fail otherwise. Moreover, in simu- 
lated examples, the transition between these two modes of operation is fairly sharp. 

This suggests that, similar to the results for ^-'^-minimization discussed above, there are important 
classes of dictionary learning problems that can be solved exactly by efficient (polynomial time) 
algorithms. 

Fully understanding this phenomenon is a long-term goal. Although local optimization ap- 
proaches to dictionary learning have repeatedly demonstrated good empirical behavior, the afore- 
mentioned difficulties of non-convexity and sign-permutation ambiguity raise significant technical 
obstacles to developing a theory of their correctness. Nevertheless, a step in this direction was taken 
by Gribonval and Schnass |GS10| . who showed that if A is square {m — n), then for certain random 
coefficient models, the desired solution is indeed a local minimum of the ^^-norm with high proba- 
bility. In this paper, we show that this is true for a wider range of matrices, including overcomplete 
dictionaries A with more columns than rows. We prove: 

If the matrix A is appropriately incoherent and the coefficients X are drawn from a 
random sparsity model, then after seeing polynomially many samples (say, n{n^)), with 
high probability the desired solution is indeed locally recoverable. 

For non-square matrices, this is the first result suggesting that correct recovery is possible by 
minimization, even locally. Establishing it seems to demand a different set of technical tools and 



ideas from |GS10| . We will see that understanding the local properties of (1.2 1 essentially requires 



us to study a certain equality-constrained norm minimization problem, which arises by linearizing 



the nonlinear constraint in (1.2) at the desired solution (A-^^X-^). While £^ norm minimization has 



been widely studied, and its correctness for recovering sparse representations in known bases (i.e. 
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problem (l.ll) is increasingly well-understood, the particular £^ minimization problem encountered 
in dictionary learning raises new challenges. In particular, we will see that the linear constraints in 
this problem do not satisfy the Restricted Isometry Property (RIP) |CT05j . a fact which significantly 
complicates their analysis. We are instead inspired by an analogy to the problem of completing a 
low-rank matrix from an observation consisting of a small subset of its entries |CR08| . another 
problem in which the RIP (and analogues) fails. In particular, our analysis is inspired by the golfing 
scheme of David Gross |Gro09) . which has proved useful for a variety of problems where the RIP 
is absent |CLMW09l ICP10| . We also make heavy use of the convenient and powerful operator 
Chernoff bounds of Joel Tropp [TrolO] , whose work builds on an approach introduced by Ahlswede 
and Winter |AW02) . 



1.1 Organization 

This paper is organized as follows. In Section [2j we describe in greater detail the model studied 



here, and formally state our main result, and discuss its implications. In particular, in Section 2.1 



we discuss its relationship to existing results. The remainder of the paper comprises a proof of this 
result. Section |3] develops optimality conditions, phrased in terms of the existence of a certain dual 
certificate. In Section |4] we construct this dual certificate. The success of the construction relies on 
a certain balancedness property of the linearized subproblem at the optimum; we formally state and 
prove this property in Section [sj 



1.2 Notation 

For matrices, X* will denote the transpose of X. \\X\\ will denote the l'^ operator norm. — 
•\/tr[X*X] will denote the Frobenius norm. By slight abuse of notation, ||X||i and ||^||oo will 
denote the and £°° norms of the matrix, viewed as a large vector: 

= V|X,,|, ||X||oo -max|X,,|. (1.3) 

For vectors x, the notation ||a;|| will mean the norm \Jx*x. ||a;||i and ||a;||oo will denote the 
usual and norms, respectively, [n] denotes the first n positive integers, {l,...,ri}. The 
symbols ei, . . . ,6^ will denote the d standard basis vectors for their dimension will be clear 
from context. Throughout, the symbols Ci, C2, . . . , ci, C2, . . . refer to numerical constants. When 
used in different sections, they need not refer to the same constant. For a linear subspace V C M'', 
we will let Py € M'*^'^ denote the projection matrix onto V . For a linear subspace V contained in 
a more general linear space (say, V C M'^^'' ), we will let Vy denote the projection operator onto 
this space. We will slightly abuse notation, and define, for / C [d], Pj to be the projection matrix 
onto the subspace of vectors supported on /; similarly, for C [d] x [d'], Vq. : R'^^'^ — > M'*^'^ will 
denote the projection operator onto J7, which retains the entries indexed by fl, and sets the rest to 
zero. As usual, A(E} B denotes the Kronecker product between matrices A and B. For B G R"^'', 
vec[B] G M"'' is defined by stacking B as a vector, columnwise. 



2 Main Result 

As described in the previous section, this paper is dedicated to better understanding the good 
behavior of £^ minimization for dictionary learning. In particular, we would like to assert that under 
natural, easily-satisfied conditions, the desired solution can be recovered, at least locally. Of course, 
whether this is true will depend strongly on the properties of the dictionary A to be recovered, 
as well as the sparse coefficients X that generate our observation Y = AX. In this paper, we 
restrict our attention to dictionaries A whose columns have unit £^ norm. We will adopt the simple 
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assumption that the cohimns of A arc well-spread in the observation space M™, i.e., the mutual 
coherence |DE03| 

/.(A)=max|(A„A,)| (2.1) 

is small. Classical results |DE031 IGN031 IFuc04j show that if A is a (known) dictionary, then £^ 
minimization recovers any sparse representation with up to l/2fi{A) nonzeros: 

ll^collo < 5(1 + 1/m(^)) =^ = argmin ||a;||i subject to Ax — Axq. (2.2) 

This result, while pessimistic compared to typical-case behavior |CP09j . is powerful because its 
assumptions on A are reasonable; it does not seem particularly onerous to assume that fJ.{A) will 
be small for learned dictionaries 

The next question is how to model the sparse coefficients X. In analogy to results in sparse 
representation, we would like to assert that dictionary learning algorithms function correctly when 
their assumptions are met, i.e., when the coefficients X are sufficiently sparse. However, it is also 
clear that by itself sparsity of X is not sufficient for {A, X) to be a local minimum. As a very simple 
example, imagine that there is some i for which all of the Xij are zero. In this paper, we assume 
that the sparsity pattern of -X" is random, and that the values of the nonzero entries are Gaussian. 

More precisely, we assume that each of the columns Xi, . . . ,Xp oi X G K"^^ is generated iid by 
first choosing k out of its n entries uniformly at random to be nonzero, and letting the magnitude 
of these nonzero entries be independent Gaussians with zero mean and common standard deviation 
a. The choice of a Gaussian model is one of mathematical convenience; the results in this paper 
are easily generalized to wider classes of symmetric distributions. However, the assumptions of zero 
mean and common variance are more essential to our analysis. We can state the above model more 
formally as follows. We assume that the observations Y = [t/i, . . . , S ig'nxp are generated iid, 

— Axj, where Xj G M" satisfies a Gaussian-random-sparsity model: 

fij - uni(^f^^^ (2.3) 

and 

Xj=Pn^v,, (2.4) 

where 

Vij A/" (0, CT^) , a = \/n/kp. (2.5) 

That is, X = Vn[V], where = {(«, j) | j S [p],i G ^j} is the overall support set. The advantage to 
writing X in this manner is that it makes independence of f2 and V clear. The scaling on Vij plays 



no essential role in our proof - the normalization in (2.51 is simply notationally convenient because 
it implies that the spectral norm, H-X"!], is approximately one when p is large. 

In dictionary learning, we do not observe A 01 X, but rather their product Y — AX e 
Corresponding to this observation Y , there exists a manifold of possible factorizations 

M = {(A,X) I AX = Y, \\A,\\2 = 1 Vi} c M™""" x M"^p. (2.6) 



In this notation, our model approach (1.2 1 can be can be viewed as a nonsmooth optimization over 
this smooth submanifold: 

minimize f{x) subject to x G A4. (2-7) 

Our main result states that if a; = (A, X) satisfies the above assumptions, then provided the number 
of samples is large enough, with high probability x will be a local minimum of /. More precisely: 



^Conversely, there is less a-priori reason to believe that dictionaries encountered from sample data will satisfy more 
powerful assumptions such as the RIP. The absence of RIP in A should not, however, be confused with the absence 
of RIP in the local analysis of dictionary learning, which as wc will see arises not from the properties of A per se, but 
rather from the structure of the tangent space to the constraint manifold. 
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Theorem 2.1. There exist numerical constants Ci,C2,C3 > such that the following occurs. If 



X = [A,X) satisfy the probability model ( 2.3 )-( |2.5[ ) with 

k < min{Ci//i(A),C2n}, (2.8) 
Then x is a local minimum of the norm over M, with probability at least 

1 ~ C73||A||2n3/2fci/2p-i/2(iogp). (2.9) 

This result implies that from polynomially many samples (say p = ujin^k)), the dictionary 
learning problem becomes locally well-posed, i.e., the desired solution becomes a local minimum of 



the £^-norm. One can see that sparsity demanded by Theorem 2.1 mimics that of (|2.2|. Indeed, this 



result implies that under essentially the same conditions as the classical bound for sparse recovery 



(2.2), one can (locally) recover all of the sparse coefficients X, as well as the sparsifying basis A. 



2.1 Comparison to existing results 

As mentioned in the introduction, the theory of dictionary learning is only beginning to develop. 
The most direct point of comparison for our result is the very nice paper of Gribonval and Schnass 



[GSIO] (henceforth "G-S"). That work proposed to study the optimization (1.2 1, and developed 
conditions for a given solution x = (A,X) to be a local minimum. These conditions essentially 
demand that x be optimal over the tangent space to the constraint manifold at x. While we do not 
directly use the optimality conditions of G-S, the duality condition that we base our approach on is 
essentially equivalent. However, the subsequent analysis uses a completely different set of tools and 
approaches. 

Aside from developing optimality conditions, the major contribution of jGSlO] is a probabilistic 
analysis of the case when A £ M"^" is square and the coefhcients X are iid Bernoulli-Gaussian, i.e., 
each Xij is nonzero with probability p, and the nonzero entries are conditionally Gaussian. Using 
arguments from geometry and concentration of measure, G-S show in this situation {A, X) is a local 
optimum with high probability provided p = r2(nlogn/p). 

Our Theorem |2.1| is more general, since it encompasses cases where A is nonsquare (i.e., an 
overcomplete dictionary). However, the number of samples stipulated by our bound is larger. Indeed, 
if we take k — 0(1), and set p = k/n for purposes of comparison, then for square matrices, G-S's 
result guarantees correct recovery from log n samples. Our result requires at least samples, 
but applies to general matrices. It is possible that the gap between the two orders of growth might 
be further closed with a more refined analysis of the construction proposed in this paper. 

2.2 Discussion 

While we find these results quite encouraging, there is still much to do. In fact, there remains a 
wealth of fascinating open problems just involving the linearized subproblem. One natural question 
is whether the assumption of hard sparsity in X can be relaxed to a Bernoulli-Gaussian model, with 
similar probability of each coefficient being nonzero; i.e., p k/n. In this case, care will need to be 
taken because a small number of columns of X may be so dense as to not be optimal. However, 
we see no essential obstacle to extending the approach used here to deal with this case. Another, 
more difficult question, is what will happen if the number of nonzero entries dramatically exceeds 
Ci/ p{A). In this case, again, many of the individual columns of X may be suboptimal, but it is still 
likely that the basis A is a local minimum. We believe that the golfing scheme of Section |4] will again 
provide a relevant tool. However, more work will need to be done to ensure that the balancedness 



condition in Theorem 5.1 still holds. Even more interesting from an application perspective would 
be to show that noise does not significantly affect the local optimality of the desired solution Xi,. 
The framework of Negahban and collaborators may be relevant here [NRWYdO] . 
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3 Local Properties and the Linearized Subproblem 



As we saw in the previous section, our main result concerns the local optimality of the desired 
solution X — (A, X) over the smooth submanifold C M™^" x M"^^. A key role in this result 
will be played by the tangent space T^M to M at x, which can be identifiecj^with the space of all 
perturbations (A^, A^) G M™><" x E"^p satisfying 

AaX + AAx=0, (A„A^,)-OVz. (3.1) 

The first equation comes from differentiating the bilinear constraint Y = AX, while the second 
comes from differentiating the constraint = 1. 

Intuitively, we might hope to study the local properties of / by studying how it behaves on the 
tangent space at x. Replacing Ai with its linearization about x yields the following optimization 
problem: 

minimize f{x + 5) subject to (5 G T^M. (3-2) 
Using the above characterization of T^M, this can be written a bit more concretely as 

minimize II X + Ax 111 subject to A^X + AAx = 0, (A^, A^i) = 0, Vi. (3.3) 

This linearized subproblem is convex. In particular, it is easy to see that under an appropriate 
change of variables, it is equivalent to an equality constrained minimization problem, 

minimize ||2||i subject to Bz = Bzq. (3.4) 

This should give us reason for optimism: as alluded to in the introduction, a great deal of effort has 
gone into developing technical tools for understanding the solutions to ^^-minimization problems. 
The following lemma tells us that in order to determine if a; is a local minimum, it is enough to ask 



whether ^ = is the unique optimal solution to the linearized subproblem (3.2 1 



Lemma 3.1. Suppose that x £ M is such that S = is the unique optimal solution to (3.2). Th 



x is a local minimum of the function /(•) over Ai. Conversely, if x is a local minimum, then S — 



is an optimal solution to (3.2|. 



Proof. Please see Appendix [D| □ 



We will prove our main result. Theorem 2.1 by showing that under the stated conditions the zero 



perturbation (A^, Ax) = (OiO) is indeed the unique optimal solution to (3.3). To do so, we need 
to study an equality constrained ^^-minimization problem of the same form as (3.4). In the absence 
of specific assumptions on the distribution of B (such as Gaussianity jDT09| ). the dominant tool 
for doing this is the Restricted Isometry Property (RIP), which holds with order k and constant 
< (5 < 1 if 

{l'~6)\\zf < \\Bz\\^ < {l + S)\\z\\'^ such that ||;2||o < /c. (3.5) 



When the RIP holds (with appropriate k,S), the ^ -minimization (3.4) recovers any sufficiently 



sparse Zq, and noise-aware versions perform stably [CanOSj . Thus, if we could show that the equality 



constraints in (3.3) satisfy an appropriate RIP variant, we would be done. 

Unfortunately, this is not the case: the RIP fails for our problem of interest. We sketch why 
this is true. At a high level, the RIP states that the operator B respects the geometry of all sparse 
vectors; in particular, there are no sparse vectors near the nullspace of B. In our case, B is specified 



^In this paper, our space of optimization Ai is most naturally specified as a submanifold of R^. We will commit 
sins of notation such as identifying the tangent space Ta,M with a particular vector subspace of R^, and occasionally 
writing x + S £ M^, where x £ Ai and S £ TscM. Given the relatively small role played by the intrinsic Riemannian 
structure of Ai in the paper, we believe these simplifications are justified. 
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by the equality constraints in (3.3). Take any permutation matrix 11 G M"^" with no fixed point, 
and set 

= -An, Ax = UX. (3.6) 
Then, it is easy to see that A.aX + AAx = 0. Moreover, for each i, 

{A„Aa^) = -(A,;, « 0, (3.7) 

which foUows because 7r(j) ^ i and A has incoherent columns. Thus, we have constructed a pertur- 
bation {Aa, Ax) that lies very near the nullspace of B, and such that Ax has exactly the same 
sparsity as the desired solution X . In fact, not only does the RIP not hold, but structured variants 
(for example, restricting to matrices Ax with sparse columns^ rather than general sparse matrices) 
also fail. We make this intuitive argument more precise in Section of the Appendix. 

This leaves us in a situation with less in common with compressed sensing, and much more in 
common with the difficult problem of matrix completion (CROSj . In that problem, we are shown a 
small set of entries Qi^ji, ■ ■ ■ ,Qip,jp of an unknown low-rank matrix Q. The goal is to fill in the 
missing values. There, the natural analogue of the RIP also fails, since the sampling operator com- 
pletely misses some low-rank matrices (for example, those consisting of a single nonzero entry that 
is not in the observed set). This fact significantly complicates analysis |CT09| . Motivated by appli- 
cations in quantum information theory, recent papers by Gross and collaborators have introduced a 
number of technical tools that significantly ease the analysis of matrix completion ^Gro09j , allowing 
them to derive near-optimal recovery guarantees in a clear and simple manner. Moreover, the ideas 
of |Gro09| appear to be useful in a variety of settings beyond matrix completion jCLMWOOl ICPlOj . 



In this paper, we use similar proof techniques to analyze the linearized subproblem (3.3). While the 
details necessarily differ quite a bit from Gross's work, our inspiration is very much the success of 
these tools in other non-RIP settings. 

To describe this scheme in more detail, however, it is easiest to start at the very beginning. We 



wish to establish that (0,0) is optimal for (3.3). To do so, we recall the KKT conditions for this 
problem, which imply that (0,0) is optimal if and only if there exist two dual variables, a matrix 
A e E^^P (corresponding to the constraint AaX + AAx — 0) and a diagonal matrix T S M"^" 
(corresponding to the constraint (A^, A^i) = 0) satisfying 

A* A e d\\ ■ ||i(X) (3.8) 
AX* = AT. (3.9) 

The interested reader can easily derive these conditions; we will provide a rigorous proof of a more 



useful variant below in Lemma 3.2 The first constraint simply asserts that each column Xj of X is 
the minimum £^ norm solution to Ax — Indeed, writing Vl — support(X) and S — sign(X), we 
recall that 

d\\ ■ lli(X) = {^ + W\ Vn[W] - 0, \\W\U < !}• (3.10) 
Then, (3.8) holds if and only if 3wi, . . . ,Wp e M™ such that 



A* A, = + -Wj, Pn^Wj = 0, HjlU < 1. (3.11) 

This constraint is quite familiar from ^^-minimization: duality, and in particular the construction of 
dual certificates \j plays a crucial role in a number of works on the correctness of £^ minimization 
|Fuc04l [CT05. WMIO;. 



On the other hand, the second constraint (3.9 1 is less familiar. It essentially asserts that locally 



we cannot improve our situation by changing the basis A. Notice that it demands that each column 
of AX* is proportional to the corresponding column of A; we find it convenient to introduce an 
operator $ : — > jjmxn projects each column onto the orthogonal complement of the 
corresponding column of A: 

$[M] = [(J - AiA^)Mi I • • • I (/ - A„A:JM„] , (3.12) 
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giving an equivalent constraint $[AX*] = 0. This constraint still places demands on all of the dual 
vectors \j simultaneously, making it potentially more difficult to satisfy than (3.8). 



In the following lemma, we trade off between the two constraints, showing that if we tighten our 



demands on (3.8), we can correspondingly loosen the demand on (3.9) 



Lemma 3.2. Let A be a matrix with no k -sparse vectors in its nullspace. Suppose that there exists 

(3.13) 



a > such that for all pairs (A^, Ax) satisfying (3.1), 

\\Vn^Ax\\F > allA^llj.. 
Then if there exists A G such that 

Vn[A*A]^-E, ||Po4^*A]||oo < 1/2, 

and 

mAx*]\\F<^, 



(3.14) 
(3.15) 



we conclude that (A^, Ax) = (0,0) is the unique optimal solution to (3.3) 



Proof Consider any feasible (Aa, Ax). Choose G d\\-\\i{X) such that {H,rn-=^x) = H^Po^ Ax||i, 
and notice that VnH = S. Then 



||X + Ax||i > ||X||i + (If,Ax) 
Notice that since (A^i, Ax) is feasible, 
{Aa,AX*) 



(3.16) 



Ax, A* A) = {AaX, A) + (AAx, A) 

= (AaX + AAx,A) = (0,A) =0. 



Hence, 



> 
> 



I 1 
I 1 
I 1 
I 1 
I 1 



> IIXI 



{H, Ax) - {A* A, Ax) - (AX*, A, 



I X 
I X 
I X 
I X 
I X 



{H-A*A,Ax)~{AX*,Aa) 

{Vn[H - A*A],VnAx) + {Vn^H - A*AlVn.Ax) 

{ViAH - A* AlVn^Ax) ~ mAX*lAA) 

\\Vn^Ax\\il2-\\AA\\Fm^X*]\\F 
{\ - a-^mAX*]\\F)\\rn^Ax\\i. 



(<I)[AX*],A, 



(3.17) 
(3.18) 
(3.19) 
(3.20) 
(3.21) 
(3.22) 



In (3.191, we have used the fact that since A^ is feasible, each column of A^ is orthogonal to 
the corresponding column of A, and so $[Aa] = A^. Furthermore, it is easily verified that <& is 
self-adjoint, and so {AX* ,^[Aa]) = ($[AX*], A^). In ( |3.20[ ), we have used that since H e d\\ - ||i, 
VnH = i: = Pn[A*A]. 

The right hand side of \i.22\ is strictl y grea ter than ||X||i provided that (i) ||<I>[AX*]||f < a/2 
and (ii) Vn^Ax 0. The assumptions (3.131 and our assumption on the nullspace of A imply 
(ii). □ 

The remainder of the argument will show that the hypotheses of this lemma indeed hold. In 
Section|4] we give a construction of a dual matrix A that always satisfies (3.14), and satisfies (3.15) 
with high probability, provided p is large enough. This is the content of Theorem 4.1 In Section 
[5] we show that with high probability the balancedness property (3.13) indeed holds with nonzero 

Combining these two 



5.1 



a (in particular, we can take a = C/|jA||^). This is done in Theorem 
results with Lemma |3.2| completes the proof of Theorem |2.1| The proofs of these key lemmas make 
repeated use of bounds on singular values of submatrices of an incoherent matrix. For completeness, 
we assemble these required results in Appendix [Cl 
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4 Certification Process 



In this section, we show how to construct the dual certificate demanded by Lemma 3.2 In particular, 
we prove the following result: 

Theorem 4.1. There exist numerical constants Ci,C2,C3 > such that the following occurs. 
Whenever 



k < niin 



(4.1) 



then for any a > 0, there exists A G K™^^ simultaneously satisfying the following three properties: 



with probability at least 



Pn[A*A] - S, 
\\rn4A*A]\\^ < 1/2, 
||$[AX*]||^ < a/2. 



(4.2) 
(4.3) 
(4.4) 

(4.5) 



4.1 First Steps 

The following lemma goes most of the way to establishing Theorem |4.1 



Lemma 4.2. Fix any p > and let Xi, . . . , Xp be independent and identically distributed random 
vectors with Xj = Pq, . Vj , where the f2j C [n] are uniform random subsets of size k and Vj 
Af{0,n/kp). Then there exists a positive integer t-i, e [[{p — l)/2\,p] and a sequence of random 
vectors Ai, . . . , Aj^ depending only on Xi, . . . , Xt^ such that 



Pn A*\, 



E 



sign(a;j). 



■3 ^3 



jlloo < 1/2, J^l,...,t^ 

< Cn^/^k^/^p-^/\ 



(4.6) 
(4.7) 

(4.8) 



where C is numerical. 



Section [4~2] proves Lemma [4. 2| by giving an explicit construction of the desired dual certificates. 
Before describing this construction in greater detail, we first show that Theorem |4.1| follows as an 
easy consequence of Lemma |4.2[ by dividing the sample set into subsets and then applying Lemma 
14.21 to each subset. 



Proof of Theorem 4-1 Choose ti according to Lemma 4.2 and let Ai, . . . , Aj^ be the corresponding 
(random) dual vectors indicated by Lemma 4.2 Then 

ti 



E 



\jX* 



< Cn^l^k^l^p-^l^ . 



3 = 1 



Moreover, unless p < 3, p ~ ti < 3p/A. Notice that the iid random vectors 



/ s 1/2 / ^l/z 



1/2 



(4.9) 



(4.10) 
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again satisfy the hypotheses of Lemma 4.2 Hence, there exists S S [[{p — ii — l)/2j,p — ti] and 
corresponding certificates Xt-^+i, . . . , ^ti+s, again satisfying (4.6 1- (4.7), such that if we set t2 — ti+S, 
we have 



E 



< 



p-ti 



1/2 



(4.11) 



This leaves at most p ~ t2 < max((3/4)^p, 2) vectors Xt^+i, . . . , ajp to be certified. Repeating this 
construction O(logp) times yields a sequence of dual certificates Ai, . . . , Ap satisfying (4.6)-(4.7l, 
with 



E 



|<i>[^A,a;*] 1 < C'(logp)n3/2fcV2p-i/2 



The desired probability estimate follows from the Markov inequality. 



□ 



4.2 The construction 

In this section, we outline a process that constructs the random sequence of certificates Ai , . . . , A^^ 
described in Lemma 14.21 We will describe a construction of p certificates Ai,...,Ap, and then 
choose ti, S [[{p — 1)/2J ,p] according to our analysis of this construction. Recall that we have 
defined fij = support(a;j) C [n]. Below, we will use &j E jj™^™ to denote the orthoprojector onto 
the orthogonal complement of the range of Aq. : 



0, 



We will let Qj denote the residual at time j: 



1=1 



As above, let crj — sign(a;j (fij)) e {±1}*^. Set 




else 



,LS 



(4.12) 



(4.13) 



(4.14) 

(4.15) 
(4.16) 



While it appears complicated, the rationale for the above procedure is actually quite simple. At 
each step j, we construct a certificate A^ G M'". We would like to make Qj = *&[X]i=i '^j^il ^ small 
as possible, while still respecting the constraints A^.Xj 



(Tj, IIA^oAjlloo < 1/2. The first term. 



A^"^ serves to ensure that the certification constraints are met. Notice that since Al-^.&j ~ 0, 



A}iM 



LS 

j 



A* \LS 



Moreover, for each i ^ ilj, 



\A*X, 



LSi 



= \A*An,{A*n^An,)-'^,\ < ||A*AoJ|2||(A^^,Ao,)-i|||k, 



(4.17) 



(4.18) 



Since CTj £ {±1}'^, ||crJ|2 = Vk. Under the assumption fc/i(A) < 1/2, a standard argument (repeated 



as (C.4) of AppendixjC| shows that ||(Aq^ Aq^.) ^|| < 2. Finally, since A*Afij is a vector of length 
k with entries bounded by fJ.{A), \\A*Aq.\\2 < fi{A)\/~k. Combining bounds, we have 

< 2kfi{A). (4.19) 



LS II I * \ LS I 

"oo — max I A,- A ,,- ' 



« 3 
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Hence, further assuming kfi{A) < 1/8, we obtain jjA^cA 



LSy 



< 1/4 and 



< 



A* \ 



LS 



^0-= Cj 



< 



< 



(4.20) 



The choice of is designed to deflate the residual Q as much as possiblerj As we will see in the 
proof of Theorem 4.1 this process does succeed in controlling the norm of the residual Qj. 

4.3 Analysis 

The next question is how to analyze the order of growth of ||Qj||f, as a function of the matrix A and 
the sparsity k. To be more formal, let be the cr-algebra generated hy fli, ... and vi, . . . ,Vj; 

Ti c ^2 C ■ ■ ■ C Tp 

is the natural filtration. We will occasionally use the notation Eq^. [/(il, V)] to denote the expectation 
over rij, with all other variables fixed. More precisely. 



En^ [f{n, V)] = E[f{n, V) I . . . , l^.+i, V) 



(4.21) 



We will use a similar notation [f{^, V)] for the expectation over Vj with all other variables fixed. 
With these notations in mind, we embark on the proof of Lemma |4.2[ 



Proof of Lemma \4^ We begin by using Qj = Qj i + <i>[Aja;*] to write 

E[IIQ,IIf l-^.-i] = ||Q,_il|^ + 2E[(Q^._i,$[A,a;*]) I +E[||<i>[A,a;*]||2, I (4.22) 

We will show that there exists e{p) > and r(p) such that 

E [{Q^_,,'P[X,x*]) I < -e{p) X ||g,_i||f , 

E[mX,x*]\\l\^,_,] < rip). 



(4.23) 
(4.24) 



Plugging in to (4.221 and taking the expectation of both sides yields 

E[||g,|||.] < E[\\Q^_,\\l]~2e{p)E[\\Q^^,\\F]+Tip). 
Summing from j = 1, . . . ,p and using that Qq = 0, we have 



nwQp 



< 



p-i 

pTip)^2eip)J2mQ,\\F] 



(4.25) 



(4.26) 



In paragraphs (i)-(iii) below, we show that the quantities e and r satisfy the following bounds: 

e{p) > Ci^k/np, and t{p) < C2nk/p. (4.27) 



For now, taking these bounds as given, we observe that by (4.26) 

EOIQiIIf] < (E[||Qi||^])i/2 < y7(^. 



(4.28) 



^Indeed, (Jj is a scaled version of a solution to the optimization problem 

minimize HQ _^ + ^a;* ||f subject to Aq.(J = 0. 
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and hence the claim of Lemma 4.2 is verified in the case p — 1. On the other hand, if p > 1, then 

onnegati^ 

P t{p) 



using the fact that the left hand side of (4.261 is nonnegative, we have 



^^E[||Q^.||^] 
p — 1 ^-^ 



< 



p-\2e{p) 



< T{p)/e{p). 



(4.29) 



We recognize the left hand side of this inequality as an average. By the Markov inequality, if we 
were to choose an index — 1} uniformly at random, then with probability at least 1/2, 

IE[||QJ_f] < 2T{p)/e{p). In particular, since [[(p— l)/^],^] contains more than half the elements 
of {1, ... ,p — 1}, there exists at least one e [[{p — 1)/2J ,p] such that 



E[||Q, 



tJ\F\ 



< 2T{p)/e{p). 



(4.30) 



Plugging in the bounds from (4.27) establishes Lemma 4.2 All that remains to do is show that the 
bounds in (4.27) indeed hold. We establish the bound on e in paragraphs (i)-(ii) below, using the 
convenient split 



E[{Q^_„X,x*)\T,_,] 



E 



E[{Qj_,X,x*)\T,^,] 



(4.31) 



Finally, in paragraph (iii) below, we establish the bound on r claimed in (4.27), completing the proof 
of Lemma 14.21 



(i) Upper bounding {Qj_-i^,\f^x*). For % = {oi < 02 < • • • < 0^}, set Uq^ = [e^i | e^^ 



For flj 
Notice that we can write 



so that Pn 

Then, using that E[sgn{vj)v*] — cicrl, we have 

E [(^Q^_,,\fx*) I = En^E^^ , An^iA;,^An^)-'Ul,^sgn{v,)v*Pn^ 

= c^aEn, , An,{Al,^An,)-'Ul,^ 
Write (Ajj^.4oJ-i = J + A(f7j). Then, we have 

Q,_i , An, {A*n^ An,r'U*n^ ) - {Q,-iPn, , An, (A,\. An,)-^U*n 

Q,^,Pn, , An,U*n\ + (Q.^^Pn, , An, A(%)C/j\ 



(4.32) 



(4.33) 



Since Qji = 'i'[X]f=o ^i^*] ^ range($), each of the columns of Qj_i € 



is orthogonal to the 



corresponding column of A. Since the first inner product in the above equation is simply the inner 
product of the restriction of A to a subset of its columns and the restriction of to a subset of 

its columns, this term is zero. Applying the Cauchy-Schwarz inequality to the second term of the 
previous equation gives 

Q^_,,AnAA*n^Anr'U*n 



< ||Q,_iPo,||f||Ao,||||A(^-)IIf. (4.34) 
yw 

by a constant, say, C2- A similar calculation in shows that || A(i7j)|| < 2kii{A). Plugging 

(4.35) 



Standard calculations, given in ( C.2[ ) of Appendi x |c| show that \\An, || < (1 + A;/i(A))^/^ is bounded 

( CT 

back into (|4.33|), we have 



E 



Qj_^,\'^^x*)\F,-i < 2ciC2akfi{A)En,[\\Qj^iPn,\\F]- 



For now, we will be content with this expression. 
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(ii) Lower bounding {Q ^_i,C,jX*) . Continuing, we have that 



— 4 



(4.36) 

(4.37) 

(4.38) 
(4.39) 



where PAn- = ^nA^nj^^,)^^ ^n, e K""""- For the first term of ( |4.39[ ), applying the Kahane- 
Khintchine inequality in Corollary |B.3[ gives 



En, [\\Q,-iPn,\\F] 



a 



(4.40) 
(4.41) 



For the second term of ( |4.39[ ), writing P^n . = A^j (Aj^^ Aq. ) x (A^^ ) as a product 

of matrices with spectral norm one, we have 



< 



iA;.An,) 



-1/2 



(4.42) 
(4.43) 



A calculation shows that under the assumption fc/i(A) < 1/2, IKA^ Ao ) ^^^|| < \/2, and so 



< \/2 : 



A-l-i^Qj-iPn.V:, 



V2 X \\Pn,A*Qj_^ Pn-Vj^ . [AAA) 



Applying the Jensen's inequality to bound the expectation of the above expression, we have (via 



Corollary B.3), 



E 



< V2xE[\\ Po, A* , Pn, 1 1 I 1] 
= V2x En, [ 1 1 Pn, A* Q^_, Pn, Vj\\] 



< (tV2 X Ej- 



\Pn,A*Q^_,Pn,\ 



(4.45) 
(4.46) 
(4.47) 



Notice that because each column of Qj i is orthogonal to the corresponding column of A, the 
diagonal elements of A*Qj_i are zero. Under this condition, we can invoke a decoupling lemma 
given as Lemma E.l to remove the first Po , giving 



Er 



\Pn,A*Q^_,Pn,\ 



< lQ\I^En, 



\A*Q^_,Pn,\\p < 16||A|h/-Eo, \\Q,^iPn, 



Via incoherence, we can show that ||A|| < ^1 + n/i(A) (see (C.ll), and so 



Ef 



< C3^/k/n + kn{A) En, 



\Pn,A*Qj_,Pn,\\p 
for appropriate C3. Combining bounds, we have shown that 

E [{Q^_^,\,x*) I Tj-i] < a ( 2ciC2fc/i(A) + ^^k/n + k^i{A) - 



\Q,-iPn 



■3 llF 



(4.48) 



A^/tt 



En, 



iQ^-iP^ 



'J 1 1 F 
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Assuming k/n and k^[A) are bounded below appropriately small constants, we have 



< -C4,a{k/n)\\Qj_j^\\F = -Ci^/k/np \\Qj^i\\f, 

where we have used Jensen's inequality and the facts that Ef2j[Pj7j] = {k/n)I and a = ^Jnjkp. 
This establishes the first part of (4.271. 



(ill) Bounding ||Aja;*||. We next bound E ||(f>[Aja;*]||^ | Tj-\ 



under the conditions of the lemma, ||Aj||2 < c^\fk + 1/4 < c^\fk. So, 

\\^{\,x*^f^ < \\X,x*\\l = WXjfWx.r < cek\\x,\\ 
Since E[||a;j||2] — n/p, we have the simple bound 



We have already shown that 
(4.49) 



E 



< Cgkn/p. 



This establishes the second part of (4.27), completing the proof of Lemma 4.2 



(4.50) 

□ 



5 Balancedness Property 

In this section, we show that for any (A^^, Ax) in the tangent space to A4 at {A, X), 

\\Vn^Ax\\F > a\\AA\\F (5.1) 

for appropriate a > 0. This property essentially says that if we locally perturb the basis, we are 
guaranteed to pay some penalty, in terms of the norm of Vn^Ax- Hence, it can be viewed as a step 



in the direction of Theorem |2.1[ By itself, it is not sufficient to establish Theorem 2J_ however, 
since it does not rule out the possibility that as A changes, ||7-'siAx||i might decrease faster than 
llT'n'^Axlli increases - for this purpose we need the golfing scheme of the previous section. On a 
technical level, however, (5.1) makes the golfing scheme possible, by allowing us to open a "hole" 



around the constraint $[AX*] = 0, and construct dual certificates A that only satisfy <i>[AX*] w 0. 
More precisely, we next show that 

Theorem 5.1. There exist numerical constants Ci, . . . , Cg > such that the following occurs. If 

1 



k < Ci X min < n 



then whenever p > C2n^ , with probability at least 



fiiA) 



Ci n exp 



C5P 

n \ogp 



Cqu exp 



Cjk^p 



(5.2) 



(5.3) 



all pairs (A^,Ax) satisfying (3.1) obey the estimate 



\\Vn^-Ax\\F > CsllA^II^^/MII 



(5.4) 
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Organization. The proof of Theorem 5.1 contains essentially two parts: algebraic manipulations 
that show that the desired property holds whenever the random matrix X satisfies two particular 
properties, and then probabilistic reasoning to show that these properties hold with the stated prob- 



ability. The first property, stated in Lemma [5^2 
of XX* . This lemma is proved in Appendix iF 



simply involves a bound on the extreme eigenvalues 
The second probabilistic property involves control- 



ling the difference between a certain operator and its large sample limit. The quantities involved 



will arise naturally in the proof of Theorem 5.1 and the claim will be formally stated in Lemma 5.3 



below. The proof of Lemma 5.3 is a bit technical, requiring us to apply the matrix Chernoff bound 
conditional on f2. This proof is given in Section [6j 

5.1 Proof of Theorem 15. IL 

Before commencing the proof of Theorem |5.1[ we introduce one additional definition. Fix < t < 
1/2, and let £eig{t) denote the event: 



£,,git) = {lj \ \\XX*-I\\<t} 



(5.5) 



In particular, on S^ig, \\XX*\\ < 1 + t < 2, \\{XX*)~^\\ = (A™„(XX*))-i < 1/(1-1) < 2. It 
should not be particularly surprising that this event is highly likely. The matrix X has iid columns, 
and it is easy to see that E[XX*] = I. The following lemma shows XX* is also close to / in the 
operator norm, with high probability: 

Lemma 5.2. Fix any < t < 1/2, and let £eig{t) denote the event that the following bound holds: 

\\XX*-I\\<t. (5.6) 
Then there exist numerical constants Ci,C2,C3 all strictly positive such that for all p > Ci(n/i)^/*, 



'[£eig{t)] > 1 - C2ri exp 



nlogp 



P 



(5.7) 



Lemma 5.2 is essentially a consequence of the matrix Chernoff bound of |TrolO| . Its proof is a 
bit technical, and so is delayed to Section [F] of the appendix. For now, we take this result as given 



and commence the proof of Theorem 5.1 



Proof of Theorem 5.1 On the event £eig, XX* is invertible, and any pair (A^i, Ax) satisfying (3.1 ) 

Aa = -AAxX*{XX*)-\ 



also satisfies 
Hence, using thal|3 



(5.8) 



\\x*{xx*)-'\ 

and that for any matrices P, Q, R, 



XX* 



\PQR\\ 



< 



1/ \J ^mini 



\P\\\\R\\\\Q\\f, 



on £ej£/(l/2) we have 



< 



\/ '^miniXX*) 



< V2\\A\\\\A 



X\\F- 



(5.9) 



We next show that for any pair {Aa,Ax) satisfying (3.1), Ax cannot be too concentrated on fl. 
More precisely, we will show 3a' < oo such that for any such Ax , 



\Vn[A 



X 



< a'\\Vn4Ax]\\F. 



(5.10) 



^This can be shown via the singular value decomposition of .X". 
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On £eig, the inverse in (5.8) is justified, and (5.8) holds. We can plug this relationship into the 
tangent space constraint A^X + AAx — 0, giving 

= A^X + AAx = -AAxX*(XX*)-iX + AAx = AAjf (J-X*(XX*)-iX) . 

Above, Px == X*(XX*)^^X is the projection matrix onto the range of X* . We have one further 
constraint A*A.j^ei — yi. We introduce a more concise notation for this constraint, by letting 
Ca:R" ^ K"''" via 



For U = [ui \ U2 



Ca[z] = A diag(2;). 
Un] £ K™^", the action of the adjoint of Ca is given by 

C*A [U] = [(Ai,Mi),...,(A„,«„)]* eM". 



Hence, our second constraint can be expressed concisely via C^[Ayi] = e M" 
On Seig, any A.x participating in a pair (A^, Ax) S T^^M must satisfy 

AAx{I ~ Px) ^0 and C*^[AAxX*{XX*)-^] = 0. 



(5.11) 
(5.12) 

(5.13) 



It is convenient to temporarily express the constraint (5.13) in vector form, as a constraint on 
= vec[Ax] G K"P. In vector notation, (5.13) is equivalent to MSx = 0, with 



M = 



(J -Px)®A 
C*A{iXX*)-^X (x)A) 



(mp+n) X np 



(5.14) 



In forming M, we have used the familiar identity vec[Qi2S'] = {S* (E) Q)vec[i2], for matrices Q, 
R, and S of compatible size. We have used Ca to denote the matrix version of the operator Ca, 
uniquely defined 

vec[CA[z]] = Caz V^eM™''". (5.15) 
It will be easier to work with a symmetric variant of the equation MS^ ~ 0. Set 

T = M*M = {I-Px)<E)A*A+ {X*{XX*)-'^(E)A*) CaC*a{{XX*)-^X^A), (5.16) 

then Mdx = if and only if 



T6, 



0. 



Splitting Sx as Sx — Pfi^x + Pfi^^x, and multiplying (5.17) on the left by Pq gives 

PqTPqSx — —P^TPncdx, 



(5.17) 
(5.18) 



[PnTPn] {PnSx) = - [PnTPnA {Piv^x). 



(5.19) 



Now, although the matrix PqTPq is rank-deficient, as we will see, its nuUspace does not contain 
any vectors z supported on fi. More quantitatively, let Sn C M"^ denote the subspace of vectors 
whose support is contained in ft (i.e., the solution space of Pnz — z), and define 



£ = inf 

zeSn\{0} 



\PnTPaz\\ 



''In particular, it is not difficult to see that Ca S 
of A. 



(5.20) 



Then if £ > 0, by ( |5.19[ ) we have 

WPn^xW < C'WPnTPndxW = C^\\[PnTP,r]Pn^5x\\ < \\PnTPn4\\Pn^^x 



is a block diagonal matrix whose blocks are the columns 
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A calculatiorj^ shows that WPqTPqcW < C||A||, and hence, thus far we have shown 

IIA^II;. < V2\\A\\\\Ax\\f < V2\\A\\{\\PnAx\\F + \\Vn^^x\\F) 

< V2\\A\\{l + Cr'\\A\\)\\Vn^Ax\\F- (5.21) 

Our only remaining tasks are to lower bound ^ to complete the bound on a, and then verify that 
the failure probability is indeed small. We carry out these tasks below, with some technical details 
associated with bounding ^ delayed to Section [6j 



The expression for T in (5.161 is quite complicated. Notice, however, that as p — >■ oo, XX — >■ / 
almost surely. If we can replace (XX*)^^ with / in (5.16), the expression will simplify significantly. 
We introduce T, this simplified approximation, given by 

T = {I - X*X)(E)A*A + {X* (E)A*)CaC*a{X (E)A) 

= I ® A*A- {X* ® A*){I -CaC\){X ® A). (5.22) 

The matrix I(^A* A is quite simple: for a matrix Z, (J® A* A)vec[Z] — vec[(A*A)Z], and so {A* A) 
simply acts columnwise on Z. We will see that if the columns of Z are appropriately sparse, then 
because A is incoherent. A* A I will approximately preserve their norms. Hence, the restricted 
singular value ^ associated with the matrix J® A* A is well-behaved. 
We therefore let R denote the nuisance term in (5.22) 

R = {X* A*){I - CaC*a){X (S) A). (5.23) 

so that we have T — I ® A* A — R, and 

T = I ® A*A- R+ {T -f). (5.24) 

In terms of these variables, 

= iWPnjl^A^A R + T-T)Pn.\\] 

^eSsAfo} [ ll^ll J 

> f \\Pn{I<E>A*A)Pnz\\ \ f \\PnRPnz\\ \ j \\Pn{T - T)Pnz\\ \ 

zes^\{0}{ \\z\\ J ^J^oX \\z\\ J ™o\ ll^ll J 

- 1 11^°'^ «„^:^'^"^' I - IIPaRPoll - l|P„(T - r)P.,||, (5.25) 

2eSn\{0} \\Z\\ J 

The first and third terms above require relatively little manipulation to control. In particular, in 
paragraph (i) below, we will show that 

zeSn\{0} I J 

In paragraph (ii) below, we will show that there is a constant > such that on E^igiti^), 

\\Pn{T~f)Pn\\ < 1/8. (5.27) 



■^Notice that \\PnTPna\\ < ||PnT|| ||Pnc |j = ||PnT||. Using ( |5.16| , write 

\\PnT\\ < \\Pn(I®A*)\\ \\{I - Px) ®A + eg) I)C aC^HXX*)-^ X eg) A)|| 

< ||Pn(/ (g A*)\\\\{I - Px) / + (g I)CaC'\{{XX*)-^X eg 

< \\Pn{I eg A*)\\{1 + 1/\^,^(XX*))\\A\\. 

Now, Pn (7 (gi A* ) is a block-diagonal matrix, with blocks given by , ■ . • , A'^^ . By | |C.2| , the operator norm of 
each of these blocks is bounded by a constant, say, c\. Hence, ||Pq(7 Cg A*)\\ is bounded by ci as well. Similarly, on 
Seig, -^min(-^''^*) bounded by a constant, giving the desired expression. 
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The analysis of P^iRPii is a bit trickier, requiring both additional algebraic manipulations and 
additional probability estimates. For now, we will define an event Sn, on which the norm of this 
term is small: 

£r = {lo I WPnRPnW < 1/8}. (5.28) 

In Section [6j we prove the following lemma, which shows that Sn is indeed likely: 

Lemma 5.3. Let be the event that ||f < 1/8. Then there exist positive numerical 

constants Ci, . . . , Cg such that whenever 



k < min < Cin 



C2 



and p > C^n^ we have 

V[£r] > 1 - C4p-* - Csn^ exp (-CefcV"^) ■ 
Plugging ( |5.26[ ), ( |5.27[ ) and ( |5.28[ ) into ( |5.25[ ), we obtain that on Seigit*) n £r 

C > l~kfiiA). 



(5.29) 



(5.30) 



(5.31) 



So, assuming Ci in the statement of Theorem 5.1 is such that k^{A) < 1/2, we have ^ > 1/4. 
Plugging this value for ^ into (5.21 ), we finally obtain that on £eig{t*) H £r, for any pair (A/i, Ax) 
in the tangent space 

\\Aa\\f < V2\\A\\il + C'\\A\\)\\Vn4^x]\\F < C" \\Af\\Vn4^x]\\F- (5.32) 

Hence, the bound claimed in Theorem |5.1| holds with probability at least 1 — P[£eig(t^)'^] — P[^^^]. 
When the constants Ci and C2 in Theorem 5.1 are chosen appropriately, the conditions of Lemmas 
5.2 of Section [F] and 5^ of Section [6] are verified. From Lemma |5.2[ 



V[8eig{ti,y] < cm exp(-c2p/n log(p)) +p 
In Lemma 15.31 we show that 

P[f^J < C3p-'* + C4n^exp(-C5fcV"^)- 

Combining the probabilities and consolidating polynomial terms establishes the desired result. It 
remains only to show that the two bounds in (5.26) and (5.27) indeed hold. 



(i) Establishing (5.261. For this term, it is more convenient to work with matrices and the 
Frobenius norm, rather than vectors and the £^ norm. Fix any Z G M."^p with z = vec[Z] e Sq 
(i.e., Z has support contained in fl). Then 

p 

\\PniI®A*A)Pnz\\^ - \\Vn[ A* AVn[Z]]\\l = Y.\\A*,,^An,Zin,,j) 



> 



al„,{Al,^An,)Y.\\Zin,,j)\\l > ||Z|||,(1 - ^/.(A))^, (5.33) 



where in the final step we have used that Z is supported on where we have used the bound 
aminiAQ.Afi.) > 1 — k^{A), shown in (C.3) of Appendix [cj The bound (5.33) holds for all z 
supported on Jl, and so (5.26) holds. 
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(ii) Establishing ( |5.27[ ). For the term Pn{T - f)Pn, write H = (XX*)-^ - I, and notice that 
T ~ T can be written as 

T T = X*X(»A*A-{X*{XX*)-'^X)^A*A 

+ {X*{XX*)-' ® A*) CaC*a{{XX*)-^X ® A) 
-{X* (S)A*)CaC*a iX(S)A) 
= (X* «) A*)(-H«)/)(X(g) A) 

+ {X*{XX*)-'^ (S)A*)CaCa{{XX*)-^X ®A) 

- {X*(S)A*)CaC*a{{XX*)-^X(8)A) 
+ {X* A*) CaCa {{XX*)-^X ® A) 

- {X* (»A*)CaC*a{X <S)A) 
= (X* (g) A*)(-H0 J)(X0 A) 

+ (X* ® A*){a ® I)CaC*a {iXX*)-^X ® A) 
+ {X* ® A*)CaCa{S ® I){X ® A) 

^ (X* ® A*) ((CaC^ - 7) H ® 7 + (H ® 7)CaC^(XX*)-i) (X ® A). 



Therefore, using ||CaC'^ — 7|| = 1 and ||C^|| = 1, we have the estimate 

\\Pn{T-f)Pn\\ < ||Po(X*®A*)|px(||H!| + l|XX*|ri|lH|l) 

< ||Pa(I®A*)|n|X|p(l + ||(XX*)-i||)||H|| 

< 6x\\Pn{I^A*)r\\S\l 



(5.34) 



where the last bound holds on £eig{t) for small enough t (e.g., i < 1/2 is sufficient). From the 
incoherence of A (i.e., (C.2)), 



PniI®A*)\\' = max||Aa,|r < l + k^i{A)<2. 



(5.35) 



Hence, on the event Setg, \\PniT -f)Pn\\ < 12 ||H||. Finally, on Seigit), ||H|| < t/{l-t); choosing t 
small enough guarantees that on £eig{t), \\Pii{T — T)Po|l < 1/8 as desired (in particular, t < 1/97 
suffices) . 



Thus, (5.261 and (5.271 hold, and Theorem 5.1 is established 



□ 



6 Controlling the residual PqRP 



In this section, we estimate the norm of the residual PfiRPfi, where 7? was defined in (5.23), and 



show that with high probability it is bounded by a small constant. To establish this result, in 
Section |6.1| below we first develop a more convenient expression for PqRPq as a sum of random 
semidefinite matrices that are independent conditioned on fl. 



6.1 Proof of Lemma 15.31 

Proof. We begin by introducing an additional bit of notation. For X e we write 

x' = e*X e M^^P (6.1) 

for the i-th row of X, and 

Xj = Xej e K" (6.2) 
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for the j-th column of X . Similarly, we let 



(6.3) 



and 



n^^{t\{t,j)en}c[n]. 



(6.4) 



Recalling the definition ( |5.23 1 and using the familiar identity, (P (g) Q) — (P (8) Q), we 



have 

R = {X* (g> I){I (g> A*){I ~ CaC*a){I(E) A){X (g) I). 
The product of the middle three terms is a block diagonal matrix 

" A*{I - AiAl)A 

{I®A*){I-CaC*a){I^A) 



(6.5) 



(6.6) 



A*{I-A,,A*JA 

For compactness, let Pi — I ~ AiA*; notice that this is the projection matrix onto the orthogonal 
complement of Ai. Then, expanding the product in (6.5) more explicitly, we have 

J2b=i -^b,iXb,iA* Pi,A ... J2b=i ^b,iXb,pA* PhA 



R 

J2b=i ^b,pXb,iA* PbA ... Xb^pXb^pA* PbA 

Breaking this sum up into n terms (indexed by common 6), we have 



(6.7) 



n 



>A*PbA]Pc 



(6.8) 



6=1 



If we set 



Pn [x' ® A*P,Aj Pn 
Pn (Pn^v'*v'Pn'<E>A*P,A]Pn 



then we have 



PnRPn = 



(6.9) 



(6.10) 



1=1 



This is a sum of random positive semidefinite matrices. Moreover, from (6.9), we observe that 
conditioned on i7, the are fixed functions of independent random vectors v\ and hence the 
are conditionally independent. We would like to apply a matrix tail bound to this sum, conditioned 
on Q. To do this, we first need to understand how the support affects the expected size of '4'^. 

With high probability, the support fl is quite regular. If we fix any i G [n], the expected 
size of ri' is simply pk/n. In fact, because the fij are independent (and hence the events j G fi* 
are independent), concentrates near this value. Moreover, if i and i' are distinct, jfi* n | 
concentrates about its expectation, which is bounded by k'^p/n?. We define a set of "desirable" 
supports, for which these quantities do not greatly exceed their expectations: 



O 



n C[n]x [p] 



maxi=i_...^„ |ri*| < 3pk/2n 
maxi^,/ r\n^'\< ipPl2n^ 



(6.11) 



It is not difficult to show that the event G O is overwhelmingly likely: 
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Lemma 6.1. With overwhelming probability, fl G O: 

¥[neO] > \-r? exp 



10n2 j 



(6.12) 



We prove Lemma 6.1 in Section 6.2 below. 

Now, when G O, the norms of the rows of X. also concentrate about their conditional ex- 
pectations. We define n events, on which the rows a;* are not too large in norm, and also do not 
concentrate too strongly on the intersection il* n fi' for any i' ^ i: 



max ||cc*Poa II < 2y^k/n, 



and llccMI < 2 



We further set 



It is not hard to show that £x is overwhelmingly likely: 
Lemma 6.2. For any f2 e O, 

P[fx \^] > 1 - exp ( - 



(6.13) 
(6.14) 



4n2 



(6.15) 



We prove Lemma 6.2 in Section 6.3 below. This lemma is useful because whenever £i occurs, is 
indeed small in norm: 



Lemma 6.3. Let £i be the event defined in (6.13), and let 'S'^ denote the i-th residual term: 
Then on event Si, we have 

||*,|| < 4k/n + 24k^i{A). 



(6.16) 



(6.17) 

We prove Lemma |6.3| in Section |6.4| below. For now, however, we show how the previous three 
lemmas, together with a matrix Chernoff bound, imply the desired result. For convenience, let 
* = PoPPa = E.*- Set 

= *, xl£,, (6.18) 



where tg. denotes the indicator random variable for the event £i. By Lemma 6.3 always satisfies 

||*z|| < Ak/n + 2Akii{A) = B. (6.19) 

Conditioned on fi, each is a function of only, and hence the are independent conditioned 
on f2. 

We apply a sequence of manipulations to reduce the problem of bounding the probability that 
ll^fll exceeds 1/8 to the problem of bounding the probability that ||^|| exceeds 1/8: 

P[||*||>i/8] = P[||*|| > 1/8 I r2 e OlPirj e O] +P[||*|| > 1/8 I e O'^lPir^ e O^^] 
< P [||*|| > 1/8 I r2 e O] + P [17 e O^] 
*ll > 1/8 1 n^] - 



Oo] + p ^ * I r^o] } + p [r! e o^] 



< max _ 

< max {P [11*11 >1/ 

< max {P [11*11 > 1/8 I no] + P [U,£f | f^o]} 
= max {P [11*11 > 1/8 I na] + P[£'k\ ^o]} + P [f^ G 



(6.20) 
(6.21) 
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In (6.201, we have used that by definition on = ^itsi is equal to ^i, while in the following line 

we have used the definition of £x = ^i^i- Now, Lemma 6.2 shows that conditioned on any fio S 
£x is overwhelmingly unlikely, while Lemma |6.1| shows that the event f2 S is overwhelmingly 
unlikely. Plugging in the bounds from those two lemmas, we have that 



*||>l/8] < maxP[||*|| >l/8|r!o] 



Of) GO 



n? exp 



< maxP[||*||>l/8|17„] +2n^exp(^-^j 



Av? J 



+ n exp 



10n2 



(6.22) 



We complete the proof by applying the matrix Chernoff bound (B.4) to the first term. Fix any 
rio e C We need to estimate Hmc 



||E[* I rJolll- Since ^ * ^ * always, 
= ||E[*|f^o]|| < l|E[* I f^olll- 



(6.23) 



This conditional expectation can be easily evaluated using the expression for * in (6.9) and the fact 
that E[v'*v'] = (n/kp) I: 



E[* I f7] = Ev 



n 



kp 



J2Pn(,Pn^®A*P,A) Pn. 



Notice that for each i, Pqi (g) A*PiA ^ Pq. (g) A* A, and so 



1=1 



kp 



A*A P 



(6.24) 



The matrix Pfii is diagonal; its {j, j) element simply counts the number of nonzero entries i in 
the j-th cohmm of fl. This number is a constant k, so P^i = kl, and 



E[* I fl] ^ -Pn{I(g)A*A)Pn. 

P 



(6.25) 

The matrix Pq{I (E) A* A)Pq is block-diagonal, with j-th block Pn A* APq.. This block has norm 
bounded by ||AoJp. Using a calculation given in (C.2|, this is in turn bounded by 3/2, provided 
k^{A) < 1/2. Hence, we have 

3n/2p. (6.26) 
We apply the matrix Chernoff bound (B.3I with t/.i,nax = 1/8, and hence t > p/\2n gives 

1/8B 



[||*|| > 1/8 I Vl] < np 



12en 
P 



(6.27) 



where we recall B is the bound on the norm of the summands By choosing the constant Ci > 
in the statement of Lemma 5.3 we can make the exponent ly = 1/8B as large as desired; the 
probability that exceeds 1/8 is bounded as 



[||*|| > 1/8 1 n] < c{y)n^+y^~'' 



(6.28) 



Assuming p > Cn^, and by appropriate choice of v, we can make the right hand side smaller than 
C'p~^ (here, the exponent 4 is clearly arbitrary). Plugging into (6.22 1 completes the proof. 

□ 
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6.2 Proof of Lemma 16.11 

Proof. Notice that = X]^=i l(i.j)eo is a sum of p independent Bernoulli(fc/n) random variables. 
Write 



1 



(i,i)ea 



1 



ii.j)en ' 



k/n. 



Then |f2*| — pk/n + X]j=i ^j- The Zj are independent, zero mean, with magnitude bounded by 1 
and variance 



1 



k 
n 



By Bernstein's inequality, for any e > 0, 



(6.29) 



Zj > ep 



< exp 



'2E[Z2] + 2e/3 



< 



exp 



pe 

2fc/n + 2e/3 



(6.30) 



Setting e = A:/2n and then taking a union bound over i £ [n] gives 



maxlil'l > 



3 kp 
2~ri 



< nexp 



p k 
10^ 



(6.31) 



Above, we have used 8 + 4/3 < 10 to simplify the constant. Similarly, notice that 

p 

is a sum of independent Bernoulli random variables Hj which take on value one with probability 



E[H,] = V[Hj ^ 1] 



n — 2\ / n 
k-2r 



< 



Set Zj = Hj — E[Hj]. Then, the bound \Zj\ < 1 always holds, and furthermore 



(6.32) 



E[Z]] = Var[i/,] < E[Hf] < 



(6.33) 



With these definitions. 



\n''nn'''\ < pk^ /n^ + ^Zj 



Again applying Bernstein's inequality to Zj, setting e = fc^/2n^ and taking a union bound over 
the (2) choices of distinct we have 



3 ^2 1 

max 117' n 17' I > 



< 



exp 



10^2 j 



(6.34) 



Summing the failure probabilities in (6.31 1 and (6.34 1 (using that (2)+"- < and that exp(— pfc/lOn) < 
exp{—pk'^ /lOn"^)) , completes the proof. □ 



6.3 Proof of Lemma 16.21 

Proof. This proof is an exercise in Gaussian measure concentration. Equation (2.35) of [LedOl] 
implies that if v is an iid sequence of A/'(0, a^) random variables, and / is a positively homogeneous, 
1-Lipschitz function, then 



'[/(-u) >E[/(-y)]+<] <exp 



2ct2 



(6.35) 
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Now, with 17 fixed, ||a;*|| = |jv'PQi|| = /(■u*) is a 1-Lipschitz function of the iid J\f{0,n/kp) vector 
v\ Since for any 17 e O, |f7'| < 3pk/2n, 



E[\\x'\\\n] < y/E[\\x^\\^ I 17] = y/\n^n/kp < V372- 
Hence, for 17 e O, 

P[||a;i > 2 I 17] < P \f{v') > E[f{v') | 17] + (2 - ^/3/2) \ n] < exp 



hp 
4n 



(6.36) 



(6.37) 



where we have used that (2 - ^3/2)2/2 w 0.3005 > 1/4 to simphfy the constant. 
Now, fix i' 7^ i. We apply exactly the same reasoning to 

\\x^P,,,\\ = \\v^Pn^P^,\\ = \\v^P,,,^,,,\\^giv^). 

Again, g{-) is a 1-Lipschitz function of Furthermore, for 17 e O, |17* n 17* | < 3pfc^/2n^, and 

E[g{v') I 17] < y/3k/2n. (6.38) 

Again, 



{v') > 2y/k/n I 17 < P g{v') > E[g{v') | 17] + (2 - ^3/2)^^/?! | 17 



< cxp — 



4n2 



A union bound over all n choices of i in (6.37) and all n{n — 1) ordered pairs completes the 

proof. □ 

6.4 Proof of Lemma 16.31 

Proof. We will show the calculations for i = 1. An identical argument works for i = 2, . . . , n as well. 
Since A is incoherent, A*PiA = A* A — A* AiA^A sa J — eiej, and so we set 



A = A*PiA-(/-eie^) eM"""". 

We notice that since fJ,{A) < 1, 

||A|U < \\A*A-I\\.^ + \\A*A,AlA-e,el\\^ < 2fi{A). 

Write 

= jPn[x'*x'<g>{I-eiel+A)')Pn\ 
< Pn(x^*x^ - eiel)^ Pn + Pnix^* x^ (g) A)P^ 
We handle the two terms individually. For the first term. 



(6.39) 
(6.40) 



L = Pn [x^*x^ (g) {I - eiel)) Pn G 



T) np X np 



we let C : M"^p 



rnxp j-j^ -(-j^j, equivalent linear operator such that for all Q G M"^^', 
vec[C[Q]] = Lvcc[Q]. 
The norm of i as a linear operator from £^ to is the same as the induced norm on C: 

\\L\\ = \\C\\ = sup — . 

Q#o WQWf 



(6.41) 



(6.42) 



(6.43) 
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From (6.42) and the relationship vec[PQR] = (R* (g) P) vec[Q], the operator C[Q] is given by 



C[Q] = Vn eie*) Vn[Q] x^*x^\ . (6.44) 
For any H E we can express the projection Vfi of H onto via its action on the rows of H: 

n 

Vn[H] - ^e„e:ifPo- (6.45) 



Applying this expression twice, (6.44) becomes 



C[Q] = J2eae:[iI-eiel)Vn[Q]x'*x 

n / n \ 



a=2 
n 

Y^eaelQ Pn^x'*x^Pn^. (6.46) 



Since the a-th summand occupies only the a-th row, we can write ||£[(5] |||, as the sum of the squared 
i'^ norms of the terms in the above expression: 

n n J 2 

wmwl = ^ J2\\<Q\\'\\^'p^-\\' ^ 16^11^11^' (6-47) 

a=2 o=2 

where above we have used that on £i, \\x^Piia || < 2y/kjn for every a ^ I. We conclude that 

Hill < 4fc/n. (6.48) 



We next address the second term in (6.41). Define 



W = Pn{x^*x^ (x)A] Pn G ro"P><"P 



(6.49) 



We need to bound the operator norm of W. As above, we associate a linear map W : M"^^ — > M"^ 
given by 



W[Q] 



Vr 



AVn[Q]x'*x'] = eaelAe,elQP,^.x^*x^Pn^. (6.50) 



In the above expression, terms for which a, 6 7^ 1 will be easily handled, since on 5i, ||a;^Pt2a|| < 
2yjk/n for every a 7^ 1. We therefore break the above summation into four terms: 



Ti = eie\ AeielQ Pnix^*x^Pn^, 

n 



6=2 



n 

and r4 = ^ e^e* A ei,el Q P^tx'^* x'^ Pna 

a,b=2 

In terms of these four quantities, 

yv[Q] = T1 + T2 + T3 + T4. 



(6.51) 
(6.52) 

(6.53) 

(6.54) 

(6.55) 



26 



Now, elAei = A\Ai — [A\AiY = 0, since A\Ai = 1. Plugging e^Aci — into (6.511, we have 
Ti = 0. Below, we show that on £i, the following bounds on the terms T2,T3,T4 hold: 



Hence, on £i, 



WT^Wf < 8ti{A)Vk\\Q\\F, 
IITsIIf < 8fi{A)Vk\\Q\\F, 
\\T4f < 8fi{A)k\\Q\\F. 

W[Q]\\p < (l6fi{A)Vk + 8f4A)k) WQWp, 



(6.56) 
(6.57) 
(6.58) 

(6.59) 



and so \\W\\ < 16fi{A)y'k + 8fi{A)k < 24fc^(A). Combining this observation with (6.48) gives the 
desired result, ( |6.17 1. We just have to establish the three inequalities ( 6.56 1-( 6.58). Paragraphs 
(i)-(iii) below do this. 



(i) Establishing (6.56). For the term T2 defined in (6.52), notice 

b 

^ b 

b 

< 4Vn^(A) I^Cbe^ QPoi-a;!' 



< 







2 





X 2 



(6.60) 
(6.61) 
(6.62) 



Above, we have used: the Cauchy-Schwarz inequality in (6.60), the bound ||X^|| < 2 on £1 in (6.61 ), 
and the bound ||A||oo < 2n{A) in ( |6.62| ). Now, 

\\J2ebelQPn^x'*\\l = Y^^elQ P^.x'^f 

b b 

< Y^WelQWlWx'P, 



„2 

nMl2 



< Ak\\Q\\%/n. 



(6.63) 



Combining (6.62) and (6.63) establishes (6.56). 



(ii) Establishing (6.57). For the term T3 defined in (6.53), we have 



3||F 



71 

< 4f,\A)Y,\\elQr\\x'PnAWPf 



*/nl|2 



< An^{A) X Ax A{k/n) X {n-l)\\elQ\ 



(6.64) 
(6.65) 



In (6.64) we have used the bound ||A||oo < 2/i(A) and the Cauchy-Schwarz inequality; in (6.64) we 
have plugged in the bounds ||X^|| < 2 and ||X^Psi"|| < I^JkJn. Finally, conservatively bounding 
lieiQIl by IIQIIf and taking the square root of both sides gives the desired result, (6.57). 
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(iii) Establishing (6.581. The final term, T4 requires a bit more manipulation. Expressing IIT4III, 
as a sum of squared i"' norms of the rows of T4 gives 



n n 



a=2 6=2 
n n 

n 71 2 

< 4(fc/n)x^(e:A^efceJQPo^a;i*) 

n n 

< 4(fc/n) X ^||e:A||2||^ebe;;QPf2.a;i* 

a=2 6=2 
n 

< 4(fc/n) X ||A||| X ||^ef,e^QPo^a;i' 

h=2 
n 

< 4(fc/n) X n^llAII^ x ^(e*gPf,.a;i 

n 

< 16 kn^l' {A) xY,\KQ\\l\\Pn^x'*\\\ 

n 

< 64fcV'(^)x^||e^Qf. 



6=2 



(6.66) 
(6.67) 
(6.68) 
(6.69) 
(6.70) 
(6.71) 
(6.72) 



Bounding the summation by ||Q|||^ and taking square roots gives (6.58). This completes the proof 



of Lemma 16.3 



□ 
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A Conditioning of the Linearized Subproblem 



In Section 2, we sketched an argument for why the restricted isometry property would not be useful 
for analyzing the linearized subproblem in dictionary learning. We demonstrated a perturbation 
pair {A. A, Ax) lying in nuUspace of the linear constraints, such that A.x has the same sparsity as 
X. The reader might wonder whether this analysis can be made rigorous, and justifiably so, since 
the classical RIP analysis pertains to an ^'^-minimization problem in which all of the variables are 
weighted equally, whereas in the linearized subproblem, is not penalized. In this section, we 
show a more precise sense in which the RIP is violated. 

To see this, notice that whenever X has full row rank n, we can write 

Aa = -AAxX*{XX*)-\ (A.l) 

Eliminating A^ yields the equivalent problem 

minimize II X + Ax 111 subject to AAx{I -Px) = 0, {A,, AAxX* {XX*)-^ei) ^ Oy i, (A.2) 

where Px = X*{XX*)^^X. If we make the substitution Z = X + Ax, we obtain an equivalent 
problem, 

minimize II Z||i subject to AZ{I - P x) = AX{I - P x), on 

{A„AZX*{XX*)-^ei) = {A„AXX*{XX*)-^e,)yi. ^ 

This is an equality constrained £^ norm minimization problem. We wish to know whether Z = X is 
the unique optimal solution (corresponding to Ax = being uniquely optimal for the original prob- 
lem). Let TT be any permutation of [n] with no fixed point, and let 11 S E"^" be the corresponding 
permutation matrix. Then it is easy to verify that if we set H = IIX E M", AH(I — Px) = 0, and 

\{A,,AHX*{XX*)-^e,)\ < fi{A). 

It is also obvious that H has the same number of nonzeros, and in fact the same number of nonzeros 
in each column as the desired solution X. Hence, if n{A) = (i.e., A is an orthonormal basis), H 
lies in the nuUspace of the linear constraints in ( A.3), and so the RIP cannot hold for this problem. 
Hence, in what is arguably the best possible case for recovering A and X , the linearized subproblem 
cannot have the RIP. Moreover, applying linear operations to the constraints in (A.3) cannot help, 
since H lies strictly in the nuUspace of these constraints. When ^{A) is nonzero but small, as in 
our above problem, H still lies very near the nuUspace, and the RIP does not hold with any useful 
constant for (A.3). Of course, strictly speaking, when ij,{A) > this argument does not preclude 
the possiblity that there is some linear transformation of the equality constraints in (A.3) that does 
have the RIP. 



B Technical Tools 



In this section, we quote two results used in our arguments. The first, which plays a key role, is 
the matrix Chernoff bound of Tropp |TrolO| . This convenient and powerful result builds on ideas 
introduced by Ahlswede and Winter |AW02) . 

Theorem B.l (Matrix Chernoff Bound, ITrolO] Theorem 2.5). Let Mi, . . . ,Mn he a finite sequence 
of independent random positive-semidefinite matrices of dimension d. Suppose that for each Mi, 
>^max{Mi) < B almost surely. Set /i„i„ = X^^n{J2i^[^i\) ^"•'^ Mmax = >^niax{J2i^[Mi])- Then 
the following two bounds hold: 



i 

axi^M,^ > (l+i)/i„ 



< dexp(-(l-t)Vm^„/2B), Vte[0,l), 



< d 



[i + ty+t 



vt > 0. 



(B.l) 
(B.2) 
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Two simplifications of tlie upper tail are useful: 



•[III: 



MA\ > (1+0 Air, 



and 



■[III: 



M, 



< dexp{~t^fi^ax/'iB) , yte[0,l], (B.3) 



< d - 



(D 



Vt > e. 



(B.4) 



The second is given in jTrolO) , while the first follows from ( B.2 1 and the inequality t — {1 + t) log(l + 
t) < — which by convexity (or calculus) can be shown to hold on [0, 1]. 

The second result we quote here is the classical Kahanc-Khintchinc inequality, here with constant 
l/-\/2 found by Latala and Oleszkiewicz [L094j : 

Theorem B.2 (Kahane-Khintchine Inequality IKah64j . |L094j Theorem 1). Let (Ti,..., cr„ be an 

iid sequence of Rademacher random variables (i.e., variables that take on ±1 with equal probability), 
and let Xi, . . . ,Xn be a fixed sequence of vectors in a normed space V . Then 



71 



V2 



1/2 



< E 



< E 



1/2 



(B.5) 



This result has the following useful consequence: 

Corollary B.3. Let M e ftg any fixed matrix, and v G M" be an iid Af{0,cr^) vector. Then 

a 



\M\\f < E[||Mt;||2] < cr||M||i.. 



(B.6) 



C Consequences of Incoherence 

In this section, we assemble several useful consequences of the assumption that A has low mutual 
coherence. All of the following bounds are well known |Fuc04| : we record their statements and (very 
simple) proofs here for completeness. As above, let A G M™^" be a matrix with unit norm columns 
and mutual coherence n{A). A bound on the mutual coherence immediately implies about on the 
norm of A. Set A = A* A I. Then 



\\Af = II A* All = ||/ + All < 1 + ||A|| < 1 + ||A||f < l + n||A||oo = l + nfi{A). 

For submatrices of A, tighter bounds can be obtained in a similar manner: if L G (j^^ 
argument shows 

||Al||2 = WAIAlW <l + kfiiA). 
Similarly, via eigenvalue perturbation bounds, 

A™„(A^Al) > 1 - kn{A). 
In particular, if we assume that fc/i(A) < 1/2, we have 

II(^I^l)-^||<2. 



(C.l) 

the same 

(C.2) 
(C.3) 
(C.4) 



We can obtain a tighter result by using the Neumann series representation of the inverse. Suppose 
that k^{A) < 1/2, and write A]^Al = I + H, and note that \\H\\f < k^{A). Then using 



(AlAi)-i = £(-l)*//\ 



(C.5) 



t=o 



we have 

UAiA^r' 



j2i-Hy\\F < Eii-^iiv < M^)/(i-M^)) < 2M^)- (c. 



t=i 
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D Local Properties 



In this section, we prove Lemma 3.1 which reduces the question of whether a; is a local optimum 
of / over A4 to one of whether 5 = minimized f{x + 5) over T^M. The reasoning behind this 
lemma is simple. The function f{A,X) = \\X\\i is Lipschitz (its Lipschitz constant L is at most 
y/np). At the same time, / is a polyhedral semi-norm. This means that the (unbounded) unit ball 
of /, -B = {a; I f{x) < 1} is a polyhedral set. Let r = min{/(a: + 5) | 5 € T^Ai}. The set of optimal 
6 precisely correspond to the the points —x + rB D T^M. In particular, if the optimizer is unique, 
then rB intersects x + T^M at a single point. Since B is polyhedral, this intersection is "sharp" : / 
increases linearly as we move away from the optimal point. This property, sometimes referred to in 
the optimization literature as strong uniqueness implies that higher order terms due to the curvature 
of are negligible; if J = is a unique optimum over the tangent space, x is locally optimal over 
A^. We now formally prove Lemma |3.1[ 



Proof. For any 5 G T^M., let xg ■ (—£,£) — > be the unique geodesic satisfying xs{Q) = x and 
xs{Q) = 5. Then 

xs{t) = x + t5 + 0{f). (D.l) 

We first prove that optimality over the tangent space is necessary. Indeed, suppose there exists 
8 e T^M with f{x + 8) < f{x). Set r = f{x) - f {x + ^) > 0. By convexity, for rj e [0, 1], 

fix + rj8) < fix) - r^T. (D.2) 

But, 

fixsit)) = fix + ,]8 + ixsit)-ix + 7j8)))<fix + 7j8) + L\\xsit)-x-7^8\\2 

< fix)-riT + L\\it-'n)8\\2 + OiLt^). (D.3) 

When t is sufficiently small and let rj = t, this value is strictly smaller than fix). 

Conversely, suppose that ^ = is the unique minimizer of fix + 8) over 8 € T^M. We will show 
that this minimizer is strongly unique (see e.g., [JO80]), i.e., 3 (3 > such that 

fix + 8)> fix) + p\\8\\ y8eT^M. (D.4) 

To see this, notice that if we write x = (A, X) and 8 = (A^i, Ax), then fix) = ||X||i. Hence, if 
we set ro = min{|Xij| | Xij 7^ 0} > 0, whenever ||Ax||oo < ''o and t < 1, we have 

||X + tAx||i = \\X\\,+t{i:,Ax)+t\\Vn^Ax\\i 

= ||X||i+f(I] + sign(nrAx),Ax). 

Set /3((5) = (S + sign(7'oc Ax), Ax), and notice that /? is a continuous function of 8. Let 

13* = inf f3i8) > 0. (D.5) 

5eT;„A^,||6||=ro/2 

Then for all 8 e T^M with \\8\\ < ro/2 wc have 

fix + 8)>fix) + i2/3*/ro)\\8\\. (D.6) 

Moreover, by convexity of /, the same bound holds for all 8 e T^M (regardless of ||^||). It remains to 
show that /3* is strictly larger than zero. On the contrary, suppose /3* = 0. Since the infimum in (D.5 ) 
is taken over a compact set, it is achieved by some 8* G T^Ai. Hence, if (3* = 0, fix + 8*) = fix), 
contradicting the uniqueness of the minimizer x. This establishes (D.4). 
Hence, continuing forward, we have 

fixsiv)) > fix) + vm\ - Oii3v'). (D.7) 
For 77 sufficiently small the right hand side is strictly greater than fix). □ 
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E A Decoupling Lemma 

The following technical lemma concerns the expected norm of the restriction PqMPq of a matrix 
M with no diagonal elements to a random diagonal block. Its proof is an application of a well-known 
decoupling technique ^LTOT] . In particular, several steps are quite similar to manipulations in the 
proof of Proposition 2.1 of |Tro08j . and are repeated here for completeness. 

Lemma E.l. Fix M E M"^" with all diagonal elements equal to zero. Let ft ~ uni(''^') be a uniform 
random subset of size k. Then the following estimate holds: 

ElWPnMPnWp] < 16^E[\\MPn\\p] . (E.l) 

Proof. Let A denote a diagonal matrix whose entries are iid Bernoulli random variables taking on 
value 1 with probability k/n. Let k' = tr[A] denote the number of nonzeros in A; k' is a binomial 
random variable. Then 



E[||AMA||f] ^ P[/c' = s] E[||AMA||f I fc' = s] , (E.2) 

s=0 
n 

> ^ P[/c' = s] E[||AMA||f I fc' = s] . (E.3) 

s—k 

Now, conditioned on k' = s, the nonzeros on the diagonal of A are distributed according to a uniform 
distribution on ('"'). Furthermore, whenever Q C support(A), WPqMP^iWp < ||AAfA||i?, since ft 
restricts to a smaller submatrix. Hence, 

V.s>A:, E[||AMA||f I fc'-s] > E[||PfiMPo||F]. (E.4) 



Plugging into (E.3l, and using that A: is a median of the binomial random variable k\ we have 

n 

E[||AMA||f] > ^P[fc'-s]E[||PnMPn||j.], (E.5) 

S—k 

= V[k' >k]E[\\PnMPn\\F], (E.6) 

> ^E[\\PnMPn\\F]- (E.7) 

Hence, we have 

E[||PoMPo||f] < 2E[||AMA||f]. (E.8) 

Similar to [TroOSj . for each i,j, let Mij e M"^" be a matrix whose (i, j) entry is equal to the {i,j) 
entry of M, and whose other entries are equal to zero. Write 



E 



|AMA||f] = e[||^A,Aj(M,, +MjO 1 (E.9) 

L M F - 



i>j 



Introduce an independent sequence of Bernoulli random variables r]i, . . . ,rin, each taking on value 1 
with probability 1/2, and write 



E 



[||^ A,Aj(M,, + Mj,)|y = 2Ea [||e„ [^(?y^(l - Vj) + - »70) A^A,(M,, + M,,) 

i>j i>j 

< 2EaE„ [1^(77,(1-77,) + 77,(1 - 77,)) A,A,(My+MjO|| J, (E.IO) 

i>j 

= 2E^Ea [15^(77,(1 -77j)+77j(l- 7;,)) A,A,(M,,+MjO||J. (E.ll) 



i>j 
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In (E.IO), we used Jensen's inequality to pull the expectation out of the norm. Now, there must be 
at least one sequence 77* such that the right hand side of (E.ll) is no smaller than its expectation 
over 77. Let T C [n] be the indices of the nonzeros in 77*, and let T'^ be its complement. Then 
combining (E.8I and (E.ll). we have 



E 



\PaMPn\\F 



< 4Ea [||E(^*(1 - ''D + - )) A.A,(M,, + M, 



= 4Ea 
< 4E 



[|| E A,A,(M,,+M 
a[|| E A,A,M,^- 



4E/ 



I E ^^^^^ 



J1- 



(E.12) 
(E.13) 
(E.14) 



Now, let a' be an independent copy of A. Then the above is equal to 

-4Ea ,A' 



I E 



4Ea,A' 

< 8Ea,a'[|| E ^^^J-^' 



E 

ieT.jGT" 



< 8 Ea E 



Sa' [||A' 



MA\\l 



1/2 



!E 



A. A' 



|A'MA||f 



IMAII 



(E.15) 

(E.16) 
(E.17) 



Above, we have used the fact that the Frobenius norm does not increase when a matrix is restricted 
to a subset of its elements to move from a summation over G T x T'^ to a summation over all 
pairs 

We now just have to move from the Bernoulli model back to the uniform model. For a fixed value 
s of fc' (i.e., a fixed number of nonzeros in A), we can divide support(A) into a = \k' /k~\ random 
subsets Si, . . . , Sa of size at most k. Conditioned on k' — s, the marginal distribution of each Si is 
uniform on (i^'i), and hence 



Ea 



IMP, 



k' = 



< 



En[\\MPn\\F] {i~l)k<s 
else 



The condition on i in the first line above simply implies that Si is nonempty. So, 



E- 



IMAII 



< E/ 



MP 



S, \\F 



=a[Ei 

i 
n 

= EEl^A[ll^^sJ|F|fc' = .s] P[A;' = s]. 



< 



s=0 



n 

SI) 

s=0 



Eo[||MPo||f] 



= Ej,[||MPn||^.] xE[fc7fc + l] = 2Ej,[||MPo||f] 



Combining (E.17) and (E.22) completes the proof. 



(E.18) 

(E.19) 

(E.20) 

(E.21) 
(E.22) 

□ 
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F Proof of Lemma 15.2 



We will establish Lemma |5.2| by applying the matrix Chernoff bound to the extreme eigenvalues of 
the sum of independent positive semidefinite matrices 



XX* ^Y. '. 



(F.l) 



A bit of care needs to be taken because the summands XjX* have unbounded norm; we handle this 
by replacing them with a sequence of truncated terms XjX* that are equivalent to XjX* with very 



high probability. 
Proof. Set 



else 

were /3 > is a constant to be chosen later. It is not difhcult to sho-wrlthat for each j 



\x,\\>il + p). 



I nlogp 
P 



< P 



(F.2) 



(F.4) 



So, maxj ||a;j|| is bounded by (1 + /3)y^n log p/p with probability at least 1 — p^~^ 1"^ . Hence, with 
at least this probability Xj = Xj V j, and J2j — J2j xjX*. Thanks to truncation, the following 
bound always holds: 

\\x,x*\\ < B ^ + (F.5) 



Since XjX* >: XjX* always, E[a;ja;*] ^ E[a;ja;*], and so 



Mr, 



I E I > XjX* 



< ||E|> XjX* 



(F.6) 



Plugging in to the bound (|B.3|), we have 
A 



jX*] > 1+t 



< n exp 



,2 

f^rnax P 



4 (1 + /3)2 n logp 



(F.7) 



Notice that there is still a dependence on ^max < 1 in the exponent. This will be resolved by 
developing a lower bound on ^mm ^ l^max- 



^Conditioned on f2j , lltCj II = ||PQ^Uj|| = /(tij) is a 1-Lipschitz function of Dj. From Jensen's inequality, E[/(Dj)] < 
y'E[/2(i;j)] < ,Jripp. Hence, from Gaussian measure concentration. 



P[/K)>]E[/(«j)|n,] + t|n,] < exp(-^^) 



(F.3) 



We set t = fi^J n logp/p and then use the fact that the bound holds for all Qj to remove the conditioning, giving 
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The smallest eigenvalue requires a bit more work, since it decreases under truncation: 



3 

i 



E > 



= 1 - 



3 3 
3 

E'E[ll^.ll2l||.,||>/S^ 



> jE[||aj,|||]JE[(l 



\>Vb) 



= l-p^E[\\x^\\i]^JV[\\x,\\>VB] 



(F.8) 

(F.9) 

(F.IO) 

(F.ll) 

(F.12) 

(F.13) 

(F.14) 
(F.15) 



Above, in (F.IO) we have used Jensen's inequality, while in (F.ll ) we have used the definition of x 



In (F.13) we have used the Cauchy-Schwarz inequality. The manipulations are completed using the 
fact that for Z - 7V(0, ct^), E[Z^] = 3a'^ to give the following bound on E||a;i||'': 



E[\\x,\\l] = E 



al) 



- ^ E [V,\V,\] = k{k - 1)^4 + kE[VA] 



From the above, write /i„im ^ 1 ^ 9{p)- Then Tropp's bound gives 

^.,.;)<l-t] < nexpP'--''^^))' ^ 

3 ^ 



2(/3 + l)2 7ilogp 



(F.16) 



For concreteness, we choose /3 = 4. Then provided p > (Cn/t)^/'*, g{p) < t/2 < 1/2, completing the 
proof. □ 
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